Introduce atomic slot migration #1949

murphyjacob4 · 2025-04-13T12:54:26Z

Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at #1591, modified to meet the designs we had outlined in #23.

New commands

CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating SLOTSRANGE ... NODE ...
CLUSTER CANCELMIGRATION ALL: Cancel all slot migrations
CLUSTER GETSLOTMIGRATIONS: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core.

Slot migration jobs

Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using the CLUSTER GETSLOTMIGRATIONS command.

Replication

Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node.
We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes.

`CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new command is added, CLUSTER SYNCSLOTS, that performs this control flow.

Each end of the slot migration connection is expected to install a read handler in order to handle CLUSTER SYNCSLOTS commands:

ESTABLISH: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots.
SNAPSHOT-EOF: appended to the end of the snapshot to signal that the snapshot is done being written to the target.
PAUSE: informs the source node to pause whenever it gets the opportunity
PAUSED: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size
REQUEST-FAILOVER: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration
FAILOVER-GRANTED: sent to the target to inform that REQUEST-FAILOVER is granted
ACK: heartbeat command used to ensure liveness

Interactions with other commands

FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty)
FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry.
Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense
SCAN and KEYS are filtered to avoid exposing importing slot data

Error handling

For any transient connection drops, the migration will be failed and require the user to retry.
If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data
If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail
If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner
The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than repl-timeout, the connection will be dropped, resulting in migration failure
If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target.

State machine

                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘

Closes #23.

Co-authored-by: Binbin [email protected]

1. Define new structure slotRange and clusterSlotSyncLink; 2. Add CLUSTER SLOTLINK command to manage all the slot sync links. Signed-off-by: Binbin <[email protected]>

1. Add CLUSTER SLOTSYNC/SLOTSYNCFORCE command to trigger slot sync. Signed-off-by: Binbin <[email protected]>

1. Extend the SYNC command, let it specify the slot ranges; 2. Enable to filter the keys in the specified slots when generate rdb; 3. Implement the handshake process before rdb transfer for slot sync.f Signed-off-by: Binbin <[email protected]>

1. Implement the rdb transfer and loading for slot sync. Signed-off-by: Binbin <[email protected]>

1. Enable to filter the cmds in the specified slots when feed slaves; 2. Implement the messages exchange for slot sync; 3. Add clusterSlotSyncCron() to handle time events for slot sync. Signed-off-by: Binbin <[email protected]>

1. Add CLUSTER FAILOVER command to trigger slot failover; 2. Implement the process of slot failover. Signed-off-by: Binbin <[email protected]>

1. Improve the delDbKeysInSlot() to support time limit; 2. Implement the slot pending delete. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

…ot RDBs In slot sync, we can load multiple slot RDBs from different nodes. slot RDBs will contain functions, which may encounter an `already exists` error when loading since every node will have the function data. This is a limitation in ours slot RDB design. This commit added a new RDBFLAGS_SLOT_SYNC flag, it is a hint that means we are targeting slot rdb. With this flag, when loading the function, we will treat it like FUNCTION LOAD REPLACE, so we won't have errors and the function reached last will win (which won't be a problem at the moment since all the function is the same in cluster). Signed-off-by: Binbin <[email protected]>

We have three nodes, A and B doing the slot sync, A is the src node, and B is the dst node. Before we doing the CLUSTER SLOTFAILOVER, we adds a new C, and C is the replica of B. In this time, if the connection between A and B breaks, in B's views, when doing freeClient, it will call onSlotSyncClientClose to reset the link, and then B will try to do a new slot SYNC in cron. And then B will call delkeysNotOwnedByMySelf to delete all the keys in that slot, which will propagate to C and C will becoma a empty DB. When B doing the new slot SYNC, A will generate a new slot RDB, and B will re-load the new slot RDB. But this new RDB file will not be propagated to node C because the master-replica connection between B and C is normal, which will result a data loss in C. In this commit, after the master node loads the slot RDB, we need to disconnect all slave nodes and allow the slave nodes to fully synchronize (we don't support slot psync). Signed-off-by: Binbin <[email protected]>

In SET command, the expire time will be rewrite to a new robj, which mostly is an INT: ``` /* Propagate as SET Key Value PXAT millisecond-timestamp if there is * EX/PX/EXAT/PXAT flag. */ robj *milliseconds_obj = createStringObjectFromLongLong(milliseconds); rewriteClientCommandVector(c, 5, shared.set, key, val, shared.pxat, milliseconds_obj); ``` And then if we are doing resharding, the server will crash. That is because we will call isCommandInSlotRanges to check the slot, and we will call getKeysFromCommand to get the key from command argv. The milliseconds_obj is an INT and the code in setGetKeys treat it as a STRING, so the server crash. In this fix, we decided that in isCommandInSlotRanges, if we find an INT, we decode it to a STRING. In addition, we added a new debug enable-debug-assert so that we can try to cover the isCommandInSlotRanges function in ours TCL tests. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

Let's take expire as an example. If the target node deleted an expired key or the key is logic expires during the process of the loading of the slot RDB, the source node may propagate the expire command after loading the slot RDB. These expire commands will become invalid on the logically expired keys, resulting in data loss. During the slot migration process, the keys are not considered expired in the expiration check of the command propagated by the source node. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

The reply will mess up the slot replication. Signed-off-by: Binbin <[email protected]>

This causes the source node to not trim the reply, causing the querybuf of the target node to be too large, resulting in the target node being disconnected dut to the querybuf limit (1G). Signed-off-by: Binbin <[email protected]>

We need to make sure this block is only executed in the source node, otherwise targe node call it will reset the slot failover. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

…flags Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

This can happend when multiple targets doing slot sync at the same time. When doing disk-based slot RDB replication, this replicationSetupReplicaForFullResync call here may result in sending the slot RDB to a different target, or to a normal replica, or sending a normal RDB to a target. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>

@enjoy-binbin

Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover). This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at valkey-io#1591, modified to meet the designs we had outlined in valkey-io#23. ## New commands * `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id`: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating `SLOTSRANGE ... NODE ...` * `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations * `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core. ## Slot migration jobs Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it. All jobs, including those that finished recently, can be observed using the `CLUSTER GETSLOTMIGRATIONS` command. ## Replication * Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node. * We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes. ## `CLUSTER SYNCSLOTS` To coordinate the state machine transitions across the two nodes, a new command is added, `CLUSTER SYNCSLOTS`, that performs this control flow. Each end of the slot migration connection is expected to install a read handler in order to handle `CLUSTER SYNCSLOTS` commands: * `ESTABLISH`: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots. * `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the snapshot is done being written to the target. * `PAUSE`: informs the source node to pause whenever it gets the opportunity * `PAUSED`: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size * `REQUEST-FAILOVER`: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration * `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER is granted * `ACK`: heartbeat command used to ensure liveness ## Interactions with other commands * FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty) * FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry. * Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense * SCAN and KEYS are filtered to avoid exposing importing slot data ## Error handling * For any transient connection drops, the migration will be failed and require the user to retry. * If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data * If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail * If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner * The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than `repl-timeout`, the connection will be dropped, resulting in migration failure * If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target. ## State machine ``` Target/Importing Node State Machine ───────────────────────────────────────────────────────────── ┌────────────────────┐ │SLOT_IMPORT_WAIT_ACK┼──────┐ └──────────┬─────────┘ │ ACK│ │ ┌──────────────▼─────────────┐ │ │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤ └──────────────┬─────────────┘ │ SNAPSHOT-EOF│ │ ┌───────────────▼──────────────┐ │ │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤ └───────────────┬──────────────┘ │ PAUSED│ │ ┌───────────────▼──────────────┐ │ Error Conditions: │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤ 1. OOM └───────────────┬──────────────┘ │ 2. Slot Ownership Change FAILOVER-GRANTED│ │ 3. Demotion to replica ┌──────────────▼─────────────┐ │ 4. FLUSHDB │SLOT_IMPORT_FAILOVER_GRANTED┼──┤ 5. Connection Lost └──────────────┬─────────────┘ │ 6. No ACK from source (timeout) Takeover Performed│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS┼────┤ └──────────────────────────┘ │ │ ┌─────────────────────────────────────▼─┐ │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│ └────────────────────┬──────────────────┘ Unowned Slots Cleaned Up│ ┌─────────────▼───────────┐ │SLOT_MIGRATION_JOB_FAILED│ └─────────────────────────┘ Source/Exporting Node State Machine ───────────────────────────────────────────────────────────── ┌──────────────────────┐ │SLOT_EXPORT_CONNECTING├─────────┐ └───────────┬──────────┘ │ Connected│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_AUTHENTICATING┼───────┤ └─────────────┬────────────┘ │ Authenticated│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_SEND_ESTABLISH┼───────┤ └─────────────┬────────────┘ │ ESTABLISH command written│ │ ┌─────────────────────▼─────────────┐ │ │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤ └─────────────────────┬─────────────┘ │ Full response read (+OK)│ │ ┌────────────────▼──────────────┐ │ Error Conditions: │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤ 1. User sends CANCELMIGRATION └────────────────┬──────────────┘ │ 2. Slot ownership change No other child process│ │ 3. Demotion to replica ┌────────────▼───────────┐ │ 4. FLUSHDB │SLOT_EXPORT_SNAPSHOTTING┼────────┤ 5. Connection Lost └────────────┬───────────┘ │ 6. AUTH failed Snapshot done│ │ 7. ERR from ESTABLISH command ┌───────────▼─────────┐ │ 8. Unpaused before failover completed │SLOT_EXPORT_STREAMING┼──────────┤ 9. Snapshot failed (e.g. Child OOM) └───────────┬─────────┘ │ 10. No ack from target (timeout) PAUSE│ │ 11. Client output buffer overrun ┌──────────────▼─────────────┐ │ │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤ └──────────────┬─────────────┘ │ Buffer drained│ │ ┌──────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤ └──────────────┬────────────┘ │ Failover request granted│ │ ┌───────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤ └───────────────┬────────────┘ │ New topology received│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS│ │ └──────────────────────────┘ │ │ ┌─────────────────────────┐ │ │SLOT_MIGRATION_JOB_FAILED│◄────────┤ └─────────────────────────┘ │ │ ┌────────────────────────────┐ │ │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘ └────────────────────────────┘ ``` Co-authored-by: Binbin <[email protected]> --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]>

This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>

If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in valkey-io#1949. Signed-off-by: Binbin <[email protected]>

If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in #1949. Signed-off-by: Binbin <[email protected]>

…ous reading of auth response (#2494) The old SLOT_EXPORT_AUTHENTICATING added in #1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]>

When we adding atomic slot migration in #1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>

We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]>

This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]>

If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in valkey-io#1949. Signed-off-by: Binbin <[email protected]>

…ous reading of auth response (valkey-io#2494) The old SLOT_EXPORT_AUTHENTICATING added in valkey-io#1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]>

When we adding atomic slot migration in valkey-io#1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>

We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>

This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>

If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in #1949. Signed-off-by: Binbin <[email protected]>

…ous reading of auth response (#2494) The old SLOT_EXPORT_AUTHENTICATING added in #1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]>

When we adding atomic slot migration in #1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>

@enjoy-binbin

Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover). This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at valkey-io#1591, modified to meet the designs we had outlined in valkey-io#23. ## New commands * `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id`: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating `SLOTSRANGE ... NODE ...` * `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations * `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core. ## Slot migration jobs Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it. All jobs, including those that finished recently, can be observed using the `CLUSTER GETSLOTMIGRATIONS` command. ## Replication * Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node. * We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes. ## `CLUSTER SYNCSLOTS` To coordinate the state machine transitions across the two nodes, a new command is added, `CLUSTER SYNCSLOTS`, that performs this control flow. Each end of the slot migration connection is expected to install a read handler in order to handle `CLUSTER SYNCSLOTS` commands: * `ESTABLISH`: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots. * `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the snapshot is done being written to the target. * `PAUSE`: informs the source node to pause whenever it gets the opportunity * `PAUSED`: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size * `REQUEST-FAILOVER`: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration * `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER is granted * `ACK`: heartbeat command used to ensure liveness ## Interactions with other commands * FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty) * FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry. * Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense * SCAN and KEYS are filtered to avoid exposing importing slot data ## Error handling * For any transient connection drops, the migration will be failed and require the user to retry. * If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data * If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail * If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner * The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than `repl-timeout`, the connection will be dropped, resulting in migration failure * If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target. ## State machine ``` Target/Importing Node State Machine ───────────────────────────────────────────────────────────── ┌────────────────────┐ │SLOT_IMPORT_WAIT_ACK┼──────┐ └──────────┬─────────┘ │ ACK│ │ ┌──────────────▼─────────────┐ │ │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤ └──────────────┬─────────────┘ │ SNAPSHOT-EOF│ │ ┌───────────────▼──────────────┐ │ │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤ └───────────────┬──────────────┘ │ PAUSED│ │ ┌───────────────▼──────────────┐ │ Error Conditions: │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤ 1. OOM └───────────────┬──────────────┘ │ 2. Slot Ownership Change FAILOVER-GRANTED│ │ 3. Demotion to replica ┌──────────────▼─────────────┐ │ 4. FLUSHDB │SLOT_IMPORT_FAILOVER_GRANTED┼──┤ 5. Connection Lost └──────────────┬─────────────┘ │ 6. No ACK from source (timeout) Takeover Performed│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS┼────┤ └──────────────────────────┘ │ │ ┌─────────────────────────────────────▼─┐ │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│ └────────────────────┬──────────────────┘ Unowned Slots Cleaned Up│ ┌─────────────▼───────────┐ │SLOT_MIGRATION_JOB_FAILED│ └─────────────────────────┘ Source/Exporting Node State Machine ───────────────────────────────────────────────────────────── ┌──────────────────────┐ │SLOT_EXPORT_CONNECTING├─────────┐ └───────────┬──────────┘ │ Connected│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_AUTHENTICATING┼───────┤ └─────────────┬────────────┘ │ Authenticated│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_SEND_ESTABLISH┼───────┤ └─────────────┬────────────┘ │ ESTABLISH command written│ │ ┌─────────────────────▼─────────────┐ │ │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤ └─────────────────────┬─────────────┘ │ Full response read (+OK)│ │ ┌────────────────▼──────────────┐ │ Error Conditions: │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤ 1. User sends CANCELMIGRATION └────────────────┬──────────────┘ │ 2. Slot ownership change No other child process│ │ 3. Demotion to replica ┌────────────▼───────────┐ │ 4. FLUSHDB │SLOT_EXPORT_SNAPSHOTTING┼────────┤ 5. Connection Lost └────────────┬───────────┘ │ 6. AUTH failed Snapshot done│ │ 7. ERR from ESTABLISH command ┌───────────▼─────────┐ │ 8. Unpaused before failover completed │SLOT_EXPORT_STREAMING┼──────────┤ 9. Snapshot failed (e.g. Child OOM) └───────────┬─────────┘ │ 10. No ack from target (timeout) PAUSE│ │ 11. Client output buffer overrun ┌──────────────▼─────────────┐ │ │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤ └──────────────┬─────────────┘ │ Buffer drained│ │ ┌──────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤ └──────────────┬────────────┘ │ Failover request granted│ │ ┌───────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤ └───────────────┬────────────┘ │ New topology received│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS│ │ └──────────────────────────┘ │ │ ┌─────────────────────────┐ │ │SLOT_MIGRATION_JOB_FAILED│◄────────┤ └─────────────────────────┘ │ │ ┌────────────────────────────┐ │ │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘ └────────────────────────────┘ ``` Co-authored-by: Binbin <[email protected]> --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in valkey-io#1949. Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

…ous reading of auth response (valkey-io#2494) The old SLOT_EXPORT_AUTHENTICATING added in valkey-io#1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>

enjoy-binbin added 30 commits January 23, 2025 10:55

Implementation of the slot async move, part I

c7ba3bb

1. Define new structure slotRange and clusterSlotSyncLink; 2. Add CLUSTER SLOTLINK command to manage all the slot sync links. Signed-off-by: Binbin <[email protected]>

Implementation of the slot async move, part II

219427b

1. Add CLUSTER SLOTSYNC/SLOTSYNCFORCE command to trigger slot sync. Signed-off-by: Binbin <[email protected]>

Implementation of the slot async move, part IV

a4f0dbe

1. Implement the rdb transfer and loading for slot sync. Signed-off-by: Binbin <[email protected]>

Implementation of the slot async move, part V

f78b62a

1. Enable to filter the cmds in the specified slots when feed slaves; 2. Implement the messages exchange for slot sync; 3. Add clusterSlotSyncCron() to handle time events for slot sync. Signed-off-by: Binbin <[email protected]>

Implementation of the slot async move, part VI

bc5c5bf

1. Add CLUSTER FAILOVER command to trigger slot failover; 2. Implement the process of slot failover. Signed-off-by: Binbin <[email protected]>

Implementation of the slot async move, part VII

2a14b69

1. Improve the delDbKeysInSlot() to support time limit; 2. Implement the slot pending delete. Signed-off-by: Binbin <[email protected]>

Support to inject errors in the slot async move

877f042

Signed-off-by: Binbin <[email protected]>

Add crstest cases for cluster slot async move

9ce37d1

Signed-off-by: Binbin <[email protected]>

Disable read slotsync data when rdb is loading

655d102

Signed-off-by: Binbin <[email protected]>

Fix double free when sync connection close after cluster slotfailover

e899194

Signed-off-by: Binbin <[email protected]>

Try to run all tests

4593bb9

Signed-off-by: Binbin <[email protected]>

Fix the dummy primary client on the target side

b96ad52

The reply will mess up the slot replication. Signed-off-by: Binbin <[email protected]>

Fix client post write issue

714cb30

This causes the source node to not trim the reply, causing the querybuf of the target node to be too large, resulting in the target node being disconnected dut to the querybuf limit (1G). Signed-off-by: Binbin <[email protected]>

Fix target node wrongly reset the slot failover

900601e

We need to make sure this block is only executed in the source node, otherwise targe node call it will reset the slot failover. Signed-off-by: Binbin <[email protected]>

Fix slot sync test, make it can pass on Mac

5af2585

Signed-off-by: Binbin <[email protected]>

Cleanup and format

dcf5cc7

Signed-off-by: Binbin <[email protected]>

Minor cleanup around sync command

e66525c

Signed-off-by: Binbin <[email protected]>

Target node as a replica send REPLCONF listening-port so we can see it

5ef6157

Signed-off-by: Binbin <[email protected]>

Set slotsync_slots default to NULL and add slot_sync_primary/replica …

9b756bd

…flags Signed-off-by: Binbin <[email protected]>

Only traverse the slot dicts instead of traversing the entir kvstore

7aef44d

Signed-off-by: Binbin <[email protected]>

Add assert to make sure we don't mess up the slot RDBs

49a51f1

Signed-off-by: Binbin <[email protected]>

Use bio to load the slot RDB

cba2f94

Signed-off-by: Binbin <[email protected]>

Move client operations to the main thread

a8b2d85

Signed-off-by: Binbin <[email protected]>

Don't set loading when loading slot RDB

3352644

Signed-off-by: Binbin <[email protected]>

enjoy-binbin mentioned this pull request Aug 18, 2025

Fix memory leak in saveSnapshotToConnectionSockets #2503

Merged

zuiderkwast mentioned this pull request Aug 18, 2025

[NEW] Use atomic slot migration in valkey-cli #2504

Open

enjoy-binbin mentioned this pull request Aug 21, 2025

Separate RDB snapshotting from atomic slot migration #2533

Merged

enjoy-binbin mentioned this pull request Aug 26, 2025

Do not migrate function in new atomic slot migration #2547

Merged

dvkashapov mentioned this pull request Sep 2, 2025

Atomic slot migraiton items for RC3 #2507

Closed

17 tasks

zvi-code mentioned this pull request Sep 10, 2025

Capture 1.1.0 work items needed for Valkey 9.0 valkey-io/valkey-search#364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce atomic slot migration #1949

Introduce atomic slot migration #1949

Uh oh!

murphyjacob4 commented Apr 13, 2025 •

edited by enjoy-binbin

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Introduce atomic slot migration #1949

Introduce atomic slot migration #1949

Uh oh!

Conversation

murphyjacob4 commented Apr 13, 2025 • edited by enjoy-binbin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New commands

Slot migration jobs

Replication

CLUSTER SYNCSLOTS

Interactions with other commands

Error handling

State machine

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

murphyjacob4 commented Apr 13, 2025 •

edited by enjoy-binbin

Loading

`CLUSTER SYNCSLOTS`